NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

https://doi.org/10.1609/aaai.v39i2.32171

Arif, Kazi_Hasan Ibn; Yoon, JinYi; Nikolopoulos, Dimitrios S; Vandierendonck, Hans; John, Deepu; Ji, Bo (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits.
more » « less
Free, publicly-accessible full text available April 11, 2026
FrameFeedback: A Closed-Loop Control System for Dynamic Offloading Real-Time Edge Inference

https://doi.org/10.1109/IPDPSW63119.2024.00116

Jackson, Matthew; Ji, Bo; Nikolopoulos, Dimitrios S (May 2024, IEEE)

Full Text Available
Application-Attuned Memory Management for Containerized HPC Workflows

Arif, Moiz; Maurya, Avinash; Rafique, M. Mustafa; Nikolopoulos, Dimitrios S.; Butt, Ali R. (May 2024, IEEE International Parallel & Distributed Processing Symposium (IPDPS))

Full Text Available
Application-Attuned Memory Management for Containerized HPC Workflows

https://doi.org/10.1109/IPDPS57955.2024.00019

Arif, Moiz; Maurya, Avinash; Rafique, M Mustafa; Nikolopoulos, Dimitrios S; Butt, Ali R (May 2024, IEEE)

Full Text Available
Application-Attuned Memory Management for Containerized HPC Workflows

Arif, Moiz; Maurya, Avinash; Rafique, M. Mustafa; Nikolopoulos, Dimitrios S.; Butt. Ali R. (May 2024, Proceedings of the 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS))

Full Text Available
On Robust Optimal Joint Deployment and Assignment of RAN Intelligent Controllers in O-RANs

https://doi.org/10.1109/OJCOMS.2024.3383607

Abdel-Rahman, Mohammad J; Mazied, Emadeldin A; Hassan, Fahid; Teague, Kory; Mackenzie, Allen B; Midkiff, Scott F; Cardoso, Kleber V; Nikolopoulos, Dimitrios S (January 2024, IEEE Open Journal of the Communications Society)

Full Text Available
Auto-scaling edge cloud for network slicing

https://doi.org/10.3389/fhpcp.2023.1167162

Mazied, EmadElDin A; Nikolopoulos, Dimitrios S; Hanafy, Yasser; Midkiff, Scott F (June 2023, Frontiers in High Performance Computing)

This paper presents a study on resource control for autoscaling virtual radio access networks (RAN slices) in next-generation wireless networks. The dynamic instantiation and termination of on-demand RAN slices require efficient autoscaling of computational resources at the edge. Autoscaling involves vertical scaling (VS) and horizontal scaling (HS) to adapt resource allocation based on demand variations. However, the strict processing time requirements for RAN slices pose challenges when instantiating new containers. To address this issue, we propose removing resource limits from slice configuration and leveraging the decision-making capabilities of a centralized slicing controller. We introduce a resource control agent (RC) that determines resource limits as the number of computing resources packed into containers, aiming to minimize deployment costs while maintaining processing time below a threshold. The RAN slicing workload is modeled using the Low-Density Parity Check (LDPC) decoding algorithm, known for its stochastic demands. We formulate the problem as a variant of the stochastic bin packing problem (SBPP) to satisfy the random variations in radio workload. By employing chance-constrained programming, we approach the SBPP resource control (S-RC) problem. Our numerical evaluation demonstrates that S-RC maintains the processing time requirement with a higher probability compared to configuring RAN slices with predefined limits, although it introduces a 45% overall average cost overhead.
more » « less
Full Text Available
Decentralised Biomedical Signal Classification using Early Exits

https://doi.org/10.1109/NEWCAS57931.2023.10198098

Xiaolin, Li; Vandierendonck, Hans; Nikolopoulos, Dimitrios S; Ji, Bo; Cardiff, Barry; John, Deepu (June 2023, IEEE)

Full Text Available
Energy-efficient localised rollback via data flow analysis and frequency scaling

https://doi.org/10.1145/3236367.3236379

Dichev, Kiril; Cameron, Kirk; Nikolopoulos, Dimitrios S. (September 2018, EuroMPI'18: Proceedings of the 25th European MPI Users' Group Meeting)

Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n2 for a process count n.
more » « less
Full Text Available

Search for: All records